Support getting checksums in weight checker#24537
Merged
Merged
Conversation
Cover _random_like dtype branches, _postprocess_tensors (non-persistent buffer skip, fp8 quant pair handling for both fp32 and ue8m0-packed scales), _check_tensors error paths, and the WeightChecker class lifecycle (snapshot, reset_tensors, compare, handle dispatch). Real fp8 tensors are constructed via quant_weight_ue8m0/transform_scale_ue8m0; no mocks of fp8 utilities.
Launches a real sgl server and exercises snapshot/compare/reset_tensors plus an unknown-action negative case. Cases that mutate weights are named to sort last so they cannot affect earlier cases sharing the server.
Two new cases on top of the existing snapshot/compare/reset coverage:
- update_weights_from_tensor with a divergent tensor must make compare
fail and surface the param name in the error message
- update_weights_from_tensor with byte-identical bytes (prime, snapshot,
push the same bytes again) must keep compare passing
Qwen3-0.6B is smaller than the previous Llama-3.2-1B-Instruct default, shortening server launch and pre-fill steps. The fused name gate_up_proj.weight is sglang's actual on-disk parameter name (no HF remapping in the path), so the test exercises the parameter unambiguously.
Matches sglang's inference-time nn.Parameter convention so that _reset_tensors can do in-place copy_ without autograd rejecting it. Fixes: RuntimeError 'a leaf Variable that requires grad is being used in an in-place operation' from the reset / compare-after-reset cases.
_snapshot's '.detach().cpu()' is a no-op on a CPU tensor, so a CPU-only fixture leaves the snapshot aliasing live storage and masks reset-then- compare divergence. Putting the fixture on CUDA mirrors production (model is always on the device) and forces _snapshot to produce a real independent CPU copy.
Sending the fused gate_up_proj.weight name directly trips a name.replace collision in sglang's stacked_params_mapping (gate_up_proj contains the substring up_proj), producing the bogus key gate_gate_up_proj.weight and crashing the model loader. Use the HF unfused alias up_proj.weight with shape (intermediate_size, hidden_size); sglang rewrites it onto the fused tensor with shard_id=1, writing only the up half — sufficient to make compare detect a divergence.
Hoists the per-callsite skip-pattern lists in _reset_tensors and _postprocess_tensors to a single module-level _NON_PERSISTENT_BUFFER_PATTERNS tuple, accessed through _is_non_persistent_buffer_name. The unified set is the union of the two prior lists: cos_sin_cache, inv_freq, freqs_cis, _weight_fp32 — both callsites skip the same buffers now (previously _reset_tensors was missing inv_freq and _postprocess_tensors was missing freqs_cis).
Adds an action='checksum' route to WeightChecker.handle that returns a dict produced by pydantic ChecksumInfo, containing per-tensor hashes (hex of tensor_hash from mm_utils, GPU-accelerated via the existing gpu_tensor_hash triton kernel) plus this rank's ParallelismInfo (tp/dp/pp coordinates + global rank/size from torch.distributed). The computation reuses _postprocess_tensors so fp8 weights are dequantized to bf16 before hashing — two (qweight, scale) pairs that dequant to the same bf16 produce the same checksum, matching the semantics of the existing snapshot/compare path. Surrounded by torch.cuda.synchronize() and timed via logger.info so callers can observe per-rank duration. handle() now returns Optional[Dict] — None for snapshot/reset/compare and the dict payload for checksum.
Plumbs the optional dict returned by WeightChecker.handle from model_runner up to the /weights_checker HTTP body: - model_runner.check_weights now returns the underlying handle() value. - CheckWeightsReqOutput gains an optional payload: Dict carrying one rank's ChecksumInfo dict. - Scheduler's check_weights captures payload into the output. - TokenizerManager.check_weights now returns (success, message, ranks) where ranks is the per-rank list collected naively in fan-out order (None when no rank produced a payload). FanOutCommunicator.merge_results is left untouched so the 11+ existing 2-tuple callers keep working. - /weights_checker HTTP body adds a top-level 'ranks' key when present, preserving the prior 'success'/'message' shape for the snapshot, reset, and compare actions.
Adds unit coverage for ChecksumInfo / ParallelismInfo / _is_non_persistent_buffer_name / _hash_tensor / _compute_checksum: hash stability and hex format, parallelism info reflection, post-mutation hash drift, and round-trip through the strict pydantic schema. Extends TestHandle to cover the new 'checksum' route. The e2e test gains four cases on the shared engine: response shape, two-call stability, hash drift after update_weights_from_tensor, and absence of non-persistent buffer names in the checksum keys.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
ltcs11
added a commit
to ltcs11/sglang
that referenced
this pull request
May 7, 2026
* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py
LLThomas
pushed a commit
to LLThomas/sglang
that referenced
this pull request
May 8, 2026
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci